Machine recognition of lexical symbols

ABSTRACT

A raster scan covers areas containing major characters of an alphabet. When a character is recognized as being one which may have an associated diacritical mark, the scan is shifted to a separate area, the contents of which are recognized from among a group of such marks. The major-character recognition unit is disabled during scanning of the diacritical marks, and vice versa. The areas may be defined on a document by rows of rectangular boxes.

United States Patent [1 1 Rubenstein [54] MACHINE RECOGNITION OF LEXICALSYMBOLS [75] Inventor: David A. Rubenstein, Rochester,

Minn.

[73] Assignee: International Business Machines Corporation, Armonk, N.\.

[22] Filed: Jan. 18, 1971 [21] Appi.No.: 106,971

[52] US. Cl. ..340/l46.3, 34011463 Z [51] Int. Cl. ..G06k 9/12 [58]Field of Search ..340/ 146.3

[561 p v Reierences Cited UNITED STATES PATENTS 3,182,290 5/1965 Rabinow..340/146.3 A0

[451 Jan. 9, 1973 3,283,303 ll/l966 Cerf ..340/l46.3Z 3,460,091 8/1969McCarthy ..340/l46.3 AH

Primary Examiner-Maynard R. Wilbur Assistant Exaniirier-Wiliiam W.Cochran Attorney-Hanifin and Jancin and A. Michael Anglin 57 ABSTRACT Araster scan covers areas containing major characters of an alphabet.When a character is recognized as 15 Claims, 5 Drawing Figures BACKlUP YAUXILIARY B'ACK a Down SEEK END 481 474 m REDUCE 9 m RASTER OR FULL SCAN32 I 7-410 SCAN RETURN .19 COUNTER-1; SEEK END 49| 493 494 PREPROCESSORmo X X X X AEIOUC AEIOU DECODER C pmmgnm 9191s I 3.710.321

SHEET 2 [1F 2 WWI MACHINE RECOGNITION OF LEXICAL SYMBOLS BACKGROUND OFTHE INVENTION The present invention concerns systems and means forrecognizing lexical symbols and is particularly directed toward themachine recognition of alphabets having auxiliary or diacritical marks.

The written form of many of the worlds languages employs the basic Romanalphabet and a number of special signs or diacritical marks for varyingthe pronounciation or meaning of certain of the letters. The machinerecognition of many of these languages requires that such marks be takeninto account.

In conventional recognition systems, diacritical marks are frequentlyignored by the machine. When they are recognized, they are considered tobe an integral part of the character itself; this requires, for instancethat one recognition logic be designed for a character A, and a separatelogic for the character A, This approach also leads to a number ofrejects and substituted characters since the diacritical mark often isconfused with a portion of the main character, thus changing itsappearance to the recognition circuit. It also frequently occurs that anoise blob or smudge in the vicinity of the character is mistaken for adiacritical mark.

SUMMARY OF THE INVENTION In the system of the present invention, ascanner traverses a document having a plurality of areas for containingpatterns classifiable into a plurality of categories, such as charactersof an alphabet. THe areas are of two types: a first type contains themajor symbols of the alphabet, while the second type contains theauxiliary symbols. The major symbols may represent any predetermined setof characters in a group or alphabet, such as Roman letters, numbers,punctuation marks or special symbols, or even a blank space. The set ofauxiliary symbols may comprise, for instance, diacritical marksbelonging to a specific language, special symbols, or any other set ofmarks which may be associable with particular ones of the majorcharacters.

Recognition is enhanced according to the invention by making areas ofthe second type disjoint or nonoverlapping with respect to those of thefirst type. The areas are preferably defined by sets of preprintedguidelines or other boundaries on the input document. Where suchboundaries are employed, a first plurality defines a row of central areafor receiving the major characters and a second plurality defines anadjacent row of substantially smaller area for receiving the auxiliarysymbols.

A first recognition unit then identifies the contents of the first orcentral area as being certain major characters or symbols of thealphabet, while a second recognition means identifies the contents ofthe second or auxiliary areas with respect to at least one predefinedset of auxiliary symbols associable with respective ones of the majorcharacters. The second recognition unit is preferably enabled only whenthe associated major character is a member of a predetermined subset ofthe characters of the alphabet. Additionally, the scanner may be made toscan the central areas and to scan associated auxiliary areas only whenthe associated major character is identified as a member of thepredetermined subset.-

Accordingly, it is an object of the present invention to advancethestate of the optical scanning, character recognition and related arts byproviding an improved character recognition system and apparatus.

It is also an object of the invention to provide a recognition systemwhich is extremely versatile and flexible in that it may be easily andinexpensively adapted to read symbols in a number of different languageswithout extensive changes.

It is another object to provide input documents for enhancing thecapabilities of such a system.

Further objects and advantages of the invention, as well asmodifications obvious to those skilled in the applicable arts, willbecome apparent from the following detailed description, taken inconjunction with the ac companying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS FIG. I is a schematic diagram of anoptical character recognition system embodying the invention.

FIGS. 2A and 2B illustrate portions of input documents useful with thesystem of FIG. 1, and further shows a scanning pattern, according to theinvention.

FIG. 3 is a schematic diagram of the recognition unit of FIG. 1.

FIG. 4 show the auxiliary scan selectors of FIG. 1.

DETAILED DESCRIPTION Referring more particularly to FIG. 1, thereference numeral denotes generally a character recognition system inwhich a scanning beam generated by a cathode-ray tube (CRT) 101 isfocused through an optical system 102 onto a document 200. Aphoto-multiplier tube (PMT) or other photo-detector 104 collects diffusereflected light from the document and converts it into an electricalsignal for a video detector 110, where it is digitized in both time andamplitude. The signal from detector proceeds through line 1A torecognition unit 300 for analysis. Digital codes corresponding to therecognized characters then proceed on line 18 to a central processingunit (CUP) channel, or data processor, 130.

Channel in turn transmits digital data on lines lF-IJ to format decoder151 of control apparatus 150. Conventional decoder 151 provides signalson line 2G for controlling the mode of operation of recognition unit300, as will be more fully described hereinafter. Decoder 151 alsoprovides scan-control signals to conventional scan selectors 153.Selectors 153 in turn provide control signals to auxiliary scanselectors 400. Lines 4A-4K, 40 and 4R carry various scan-selectionsignals to beam control unit 160, which in turn provides deflectionsignals on lines 1M and IN to CRT 101.

The conventional portions of the system of FIG. 1 are more fullydescribed in commonly owned U. S. Pat. application Ser. No. 829,397,filed June 2, 1960, by D. L. Johnston and P. E. Nelson. The presentinvention however, is also useful with recognition systems other thanthe particular example shown in FIG. 1.

FIG. 2A shows an enlarged portion of a document 200 having distinct rowsof fields 210 for receiving handwritten characters. Each field 210contains a first plurality of boundaries 221-224 defining a number ofcentral areas 230 for receiving the major characters of the alphabet tobe recognized. Each row 210 extends in a horizontal direction and therows 210 are disposed vertically with respect to each other on document200. As may be seen, boundaries 221-224 form a substantially rectangularbox of convenient size. Associated with each central area 230 is atleast one auxiliary area 240, defined by a second plurality ofboundaries 251-254. Each auxiliary area 240 is associated with onecentral area 230, although each central area 230 may be associated withmore than oneauxiliary area 240. Where the language to be recognizedcontains both superior and inferior diacritical marks, areas 240 arelocated above and below areas 230, the areas 240 being separated froeach other by areas 230. It should be noted that areas 240 arecompletely separate and disjoint, although the two types of areas 230and 240 are located adjacent to one another. They may, in fact, belocated contiguously, so that the boundaries 253 of the second pluralityare common with the boundaries 221 and 223 of the first plurality.

Each are 230 may have boundaries 222 and 224 in common with other areas230; similarly, each area 240 may have boundaries 252 and 254 in commonwith further ones of the areas 240. In accordance with conventionalpractice, boundaries 221-224 and 251-254 are preferably invisible torecognition unit 300. This effect may be accomplished by printing theboundaries in an ink which is invisible to photodetector 104, FIG. 1. Itmay also be accomplished by printing the boundaries as a series of smallelements (such as dots) which give the visual impression of lines, butwhich are filtered out as noise" by video detector 110 or by recognitionunit 300. That is, the term boundary," as used herein, is to be taken asone or more elements which have the visual effect of separating one areafrom another. Moreover, it may be preferable in some applications toform the areas 230 and/or 240 in other than rectangular shapes.Boundaries 221-224 and 251-254 may, for instance, define other types ofparallelograms, such as rhomboids.

FIG. 2B shows a row of letters 201-204 and associated diacritical marks205, 206 upon a document 200 in which central area 270 and auxiliaryareas 280 are defined by a scan pattern 290 rather than by preprintedguidelines. Details of scan 290 will be discussed in connection withFIG. 4.

Referring now to FIG. 3, conventional preprocessor 310 of recognitionunit 300 transmits signals corresponding to the presence or absence ofpredetefmined features of an input character on lines 311-313.Preprocessor 310 may perform the conventional functions of patternstorage registration, segmentation and feature extraction. Conventionalrecognition logic 320 processes the feature signals on line 311 toproduce an identification code on line 321 which is indicative of themajor characters contained in central areas 230 or 270. Line 321 alsotransmits the identifiying code, via line 351, to a decoder 330, whichis enabled by signal on line 2G when format decoder 151 has detected acommand from CPU channel 130 that the alphabet to be recognized maycontain diacritical marks or other auxiliary symbols.

In the example to be described the Roman letters A, E," I, and Ucomprise a first subset of the alphabet; this subset may have one of apredetermined group of superior diacritical marks located thereabove. Asecond subset, comprising the single letter C," may have an inferiordiacritical mark located therebelow. When one of the characters in thefirst subset has been recognized by logic 320, decoder 330 transmits asignal on line 331 for energizing recognition unit 340. Logic 340 may berelatively rudimentary in form, since it need recognize only thosesymbols contained in the set of the accents acute, grave and circumflex,the diaresis (or double dot), and a blank space. A code corresponding tothe recognized symbol of this set is then transmitted to output unit 350on line 341. Similarly, the single letter C forms another subset of thealphabet, since it may have a cedilla located in an auxiliary spacetherebelow. For this second subset, line 332 from decoder 330 provides asignal for enabling diacritical recognition logic 360. Logic 360 may beeven simpler than logic 340, since it need only differentiate betweenthe cedilla and a blank space. Its identification code is transmitted online 361 to output unit 350. Deconder 330 may also provide a signal online 333 whenever a character in either of the subsets is recognized.This signal disables recognition logic 320 for either of the two subsets(or, equivalently, enables it under the opposite condition), so thatlogic 320 cannot confuse one of the diacritical marks with any of themajor characters.

Output unit 350 may be a conventional buffer storage for holdingidentification codes on any of the lines 321, 341 and 361, and fortransmitting these codes to CPU channel over line 1B. If, on the otherhand, it is desired that a first identification code be transmitted fora letter not having a certain diacritical mark, and a different code betransmitted for the same letter with a specific diacritical mark, thenoutput unit 350 may include a code modifier or translator for modifyingthe code on line 321 in accordance with a code on line 341 or 361. Unitsfor performing this function are also well known in the art.

FIG. 4 shows the auxiliary scan selectors 400 for executing a scanningpath such as that shown at 290, FIG. 2B. Scan pattern 290 is alsopreferably employed with a document having preprinted guidelines such asthose shown in FIG. 2A. In an initial portion 291 of pattern 290, scanselectors 153 cause CRT 101 to execute a vertical raster scan over thecentral areas 270. A conventional signal on line 473, passed through ORgate 474, enables raster-scan generator 470 to produce signals on line4G to control this scan. (Line 4G is included in the cable 4A-4K shownin FIG. 1.) The conditions under which conventional signals 473 may begenerated are shown in more detail in the aforementioned patentapplication Ser. No. 829,397. Raster portion 291 continues through thecharacters 201 and 202, FIG. 2B.

When character 202 is recognized as being a member of the subset ofletters which may contain a superior diacritical mark, however, thepreviously mentioned signal on line 331 is transmitted on line 3K toseek generator 480 to produce a signal on line 40 causing beam controlto move the scanning beam back and upward along line 292 to the upperauxiliary area 280 associated with character 202. When a signal on line481 indicates that scan line 292 has reached its destination, input line475 causes raster generator 470 to produce signals on line 4G to movethe scanning beam in a reduced-size raster 293. The seek-end signal online 481 is also transmitted to an enabling input 491 of a scan counter492. Then, when reduced raster 293 reaches the end of auxiliary area 280after a predetermined number of scans, a signal on line 493 causes seekgenerator 490 to produce a signal on line 4R which in turn causes beamcontrol 160 to move the scanning beam in a path 294 to the central area270 for the next major character 203. When the beam has reached apredetermined position in central area 270, a signal on line 494 causesraster generator 470 through OR gate 474 to again produce a full-sizeraster scan 295.

When seek generator 480 receives a signal on line 3L at the completionof scanning of the character 203, a similar sequence ensues. This time,however, seek scan 296 leads back and downward to the lower auxiliaryarea 280 for character 203, since it is a member of the second subset ofthe alphabet. The seek-end signal on line 481 then initiates a reducedraster scan 297 over the lower auxiliary area until generator 490receives a signal on line 493. At this point, generator 4% produces ascanning path 298 to the central area 270 for the next character 204. Aseek-end signal on line 494 then energizes raster generator 470 aspreviously described, and the scan cycle repeats itself.

In summary, auxiliary scan selectors 400 cause the scanning beam totraverse the row of central areas 270 on document 200. Wheneverrecognition unit 300 identifies a character belonging to one or moregroups or subsets of the alphabet which may contain diacritical marks,signals on line 3K or 3L cause scan control 150 to interrupt its normalsequence and to scan the appropriate auxiliary areas 280 for thepresence of a mark. Within recognition unit 300, the diacritical logics340 and 360 are inhibited during scanning of the central areas 270,while logic 320 is inhibited during the scanning of the auxiliary areas280; in this way, no confusion can result between the set of majorcharacters and the set of diacritical marks or other auxiliary symbols.The scan pattern 290 conserves total scanning time, since only thoseauxiliary areas which might possibly contain a diacritical mark arescanned. Other types of scan patterns for achieving similar results mayalso be visualized. A scanning beam may, for instance, traverse theentire row of central areas while the recognition unit 300 records thepositions of all major characters in the row which may have adiacritical mark associated therewith. The scanning beam may then returnto the beginning of the row and scan only those auxiliary areas 280corresponding to the major characters whose position have been recorded.It would also be possible to extend the concepts of the above describedscan pattern to other types of scanners, such as linear-array scanners(not shown). Other variations within the scope and spirit of theinvention will also suggest themselves to those skilled in theapplicable arts.

Having described a preferred embodiment thereof, 1 claim as myinvention:

1. A system for recognizing lexical symbols, comprising:

means for scanning a document having a plurality of areas; firstrecognition means for identifying the contents of a first of said areasas being a major symbol representing one character of an alphabet;

second recognition means for identifying the contents of a second ofsaid areas with respect to a set of auxiliary symbols associable withparticular ones of said characters, said second area being disjoint fromsaid first area; means responsive to said first recognition means forenabling said second recognition means when said one character is amember of a predetermined subset of said alphabet; and output meansresponsive to both said first and said second recognition means fortransmitting to a utilization means a first code representing said onecharacter, and for selectively transmitting to said utilization device asecond code when said second recognition means has been enabled.

2. The system of claim 1, wherein said second area is adjacent saidfirst area.

3. The system of claim 2 wherein said set of auxiliary symbols is apredetermined group of diacritical marks for characters in saidpredetermined subset.

4. The system of claim 1, further comprising third recognition means foridentifying the contents of a third of said areas with respect to afurther set of auxiliary symbols associable with particular ones of saidcharacters; and wherein said enabling means is further responsive tosaid first recognition means for enabling said third recognition meanswhen said one character is a member of a further predetermined subset ofsaid alphabet.

5. The system of clalm 4, wherein said second and third areas areadjacent said first area, and are separated from each other by saidfirst area.

6. The system of claim 1, wherein said scanning means is responsive tosaid first recognition means for scanning said second area only whensaid one character is a member of said predetermined subset.

7. The system of claim 1, wherein said output means is operative totransmit both said first and second codes sequentially to saidutilization device.

8. The system of claim 7, wherein said first code represents anunmodified form of said one character, and wherein said second coderepresents one of said auxiliary symbols.

9. The system of claim 1, wherein said second code represents a modifiedform of said one character.

lli]. The system of claim 9, wherein said modified form represents thecombination of said one character and one of said auxiliary symbolsassociable therewith.

11. A system for recognizing a plurality of input patterns, comprising:

means for executing a scan in a plurality of central areas of a field;

first recognition means for classifying patterns in said central areasinto respective ones of a first plurality of categories; means fordetecting those of said patterns belonging to a predetermined group insaid first plurality;

means responsive to said detecting means for shifting s-aid scan to aplurality of auxiliary areas of said field corresponding to those ofsaid central areas containing patterns belonging to said predeterminedgroup;

second recognition means for classifying the contents of said auxiliaryareas into respective ones of a second plurality of categories; and

means responsive to said detecting means for enabling said secondrecognition means during said shifted scan, wherein said enabling meansis further responsive to said detecting means for inhibiting said firstrecognition means during said shifted scan.

12. The system ofclaim 11, wherein said areas are bounded by a pluralityof lines preprinted on said field.

1. A system for recognizing lexical symbols, comprising: means forscanning a document having a plurality of areas; first recognition meansfor identifying the contents of a first of said areas as being a majorsymbol representing one character of an alphabet; second recognitionmeans for identifying the contents of a second of said areas withrespect to a set of auxiliary symbols associable with particular ones ofsaid characters, said second area being disjoint from said first area;means responsive to said first recognition means for enabling saidsecond recognition means when said one character is a member of apredetermined subset of said alphabet; and output means responsive toboth said first and said second recognition means for transmitting to autilization means a first code representing said one character, and forselectively transmitting to said utilization device a second code whensaid second recognition means has been enabled.
 2. The system of claim1, wherein said second area is adjacent said first area.
 3. The systemof claim 2 wherein said set of auxiliary symbols is a predeterminedgroup of diacritical marks for characters in said predetermined subset.4. The system of claim 1, further comprising third recognition means foridentifying the contents of a third of said areas with respect to afurther set of auxiliary symbols associable with particular ones of saidcharacters; and wherein said enabling means is further responsive tosaid first recognition means for enabling said third recognition meanswhen said one character is a member of a further predetermined subset ofsaid alphabet.
 5. The system of claIm 4, wherein said second and thirdareas are adjacent said first area, and are separated from each other bysaid first area.
 6. The system of claim 1, wherein said scanning meansis responsive to said first recognition means for scanning said secondarea only when said one character is a member of said predeterminedsubset.
 7. The system of claim 1, wherein said output means is operativeto transmit both said first and second codes sequentially to saidutilization device.
 8. The system of claim 7, wherein said first coderepresents an unmodified form of said one character, and wherein saidsecond code represents one of said auxiliary symbols.
 9. The system ofclaim 1, wherein said second code represents a modified form of said onecharacter.
 10. The system of claim 9, wherein said modified formrepresents the coMbination of said one character and one of saidauxiliary symbols associable therewith.
 11. A system for recognizing aplurality of input patterns, comprising: means for executing a scan in aplurality of central areas of a field; first recognition means forclassifying patterns in said central areas into respective ones of afirst plurality of categories; means for detecting those of saidpatterns belonging to a predetermined group in said first plurality;means responsive to said detecting means for shifting s-aid scan to aplurality of auxiliary areas of said field corresponding to those ofsaid central areas containing patterns belonging to said predeterminedgroup; second recognition means for classifying the contents of saidauxiliary areas into respective ones of a second plurality ofcategories; and means responsive to said detecting means for enablingsaid second recognition means during said shifted scan, wherein saidenabling means is further responsive to said detecting means forinhibiting said first recognition means during said shifted scan. 12.The system of claim 11, wherein said areas are bounded by a plurality oflines preprinted on said field.
 13. The system of claim 11, furthercomprising means responsive to said shifting means for returning saidscan from said auxiliary areas to said central areas.
 14. The system ofclaim 13, wherein said scan in said central areas is a raster scan. 15.The system of claim 14, wherein said shifted scan is a raster scanacross said auxiliary areas.